Exploring the Data

When we want to explore the data, it is a good pratice to view top n contents of a file rather than loading the file entirely. Here, we start with exploring the data using helpful functions such as “head”, “tail”, “str”, and “dim”:

salary <- read.csv("./salary_data.csv", strip.white = T, na.strings = "?")

# View the first 6 rows of data
head(salary)
# View the last 6 rows of data
tail(salary)
# View a condensed summary of the data
str(salary) # Do we need this ? We are doing this again below
## 'data.frame':    32561 obs. of  15 variables:
##  $ age            : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass      : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
##  $ fnlwgt         : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education      : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education.num  : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ martital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation     : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
##  $ relationship   : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race           : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex            : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital.gain   : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital.loss   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native.country : Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
##  $ salary.class   : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
# Check the class of salary
class(salary)
## [1] "data.frame"
# Check the dimensions of salary
dim(salary)
## [1] 32561    15
# View the column names of salary
colnames(salary)
##  [1] "age"             "workclass"       "fnlwgt"          "education"
##  [5] "education.num"   "martital.status" "occupation"      "relationship"
##  [9] "race"            "sex"             "capital.gain"    "capital.loss"
## [13] "hours.per.week"  "native.country"  "salary.class"

It is recommended to always understand the structure of the data you are working with and displaying the summary of the data. To view the structure of the data we can use the following functions:

# Check the structure of salary
str(salary)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age            : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass      : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
##  $ fnlwgt         : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education      : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education.num  : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ martital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation     : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
##  $ relationship   : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race           : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex            : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital.gain   : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital.loss   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native.country : Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
##  $ salary.class   : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
# View a summary of salary
summary(salary)
##       age                   workclass         fnlwgt
##  Min.   :17.00   Private         :22696   Min.   :  12285
##  1st Qu.:28.00   Self-emp-not-inc: 2541   1st Qu.: 117827
##  Median :37.00   Local-gov       : 2093   Median : 178356
##  Mean   :38.58   State-gov       : 1298   Mean   : 189778
##  3rd Qu.:48.00   Self-emp-inc    : 1116   3rd Qu.: 237051
##  Max.   :90.00   (Other)         :  981   Max.   :1484705
##                  NA's            : 1836
##         education     education.num                martital.status
##  HS-grad     :10501   Min.   : 1.00   Divorced             : 4443
##  Some-college: 7291   1st Qu.: 9.00   Married-AF-spouse    :   23
##  Bachelors   : 5355   Median :10.00   Married-civ-spouse   :14976
##  Masters     : 1723   Mean   :10.08   Married-spouse-absent:  418
##  Assoc-voc   : 1382   3rd Qu.:12.00   Never-married        :10683
##  11th        : 1175   Max.   :16.00   Separated            : 1025
##  (Other)     : 5134                   Widowed              :  993
##            occupation            relationship                   race
##  Prof-specialty : 4140   Husband       :13193   Amer-Indian-Eskimo:  311
##  Craft-repair   : 4099   Not-in-family : 8305   Asian-Pac-Islander: 1039
##  Exec-managerial: 4066   Other-relative:  981   Black             : 3124
##  Adm-clerical   : 3770   Own-child     : 5068   Other             :  271
##  Sales          : 3650   Unmarried     : 3446   White             :27816
##  (Other)        :10993   Wife          : 1568
##  NA's           : 1843
##      sex         capital.gain    capital.loss    hours.per.week
##  Female:10771   Min.   :    0   Min.   :   0.0   Min.   : 1.00
##  Male  :21790   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00
##                 Median :    0   Median :   0.0   Median :40.00
##                 Mean   : 1078   Mean   :  87.3   Mean   :40.44
##                 3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00
##                 Max.   :99999   Max.   :4356.0   Max.   :99.00
##
##        native.country  salary.class
##  United-States:29170   <=50K:24720
##  Mexico       :  643   >50K : 7841
##  Philippines  :  198
##  Germany      :  137
##  Canada       :  121
##  (Other)      : 1709
##  NA's         :  583

Using “tidyverse” to Explore & Clean Data

One of the most useful R packages in the majority of data science projects is “tidyverse”. This package enables us to use our data as an object, and explore/ process the dataset using various attributes. Basically, “tidyverse” helps us to treat our inputs (datasets, variables, vectors, …) as objects. This is quite important because R is a functional programming language rather than an object-oriented programming language (such as Python). Hence, “tidyverse” enables us to benefit from object-oriented programming style in R. This would make it very easy to explore, process, clean, and change the data in R. Please note that “tidyverse” is a collection of packages. When we load this package, we are actually loading several packages that are available within “tidyverse”. These packages within tidyverse() such as “dplyr” help us solve the most common data manipulation challenges very easily. The main attributes in “dplyr” include:

Below, we use these attributes to explore the data and perform common data wrangling tasks. But first, we use function glimpse() to get a glimse of the data. The output of this function is similar to str():

# Load tidyverse
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages --------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'forcats' was built under R version 3.6.2
## -- Conflicts ------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Check the structure of salary, the dplyr way
glimpse(salary)
## Observations: 32,561
## Variables: 15
## $ age             <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23,...
## $ workclass       <fct> State-gov, Self-emp-not-inc, Private, Private, Priv...
## $ fnlwgt          <int> 77516, 83311, 215646, 234721, 338409, 284582, 16018...
## $ education       <fct> Bachelors, Bachelors, HS-grad, 11th, Bachelors, Mas...
## $ education.num   <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12,...
## $ martital.status <fct> Never-married, Married-civ-spouse, Divorced, Marrie...
## $ occupation      <fct> Adm-clerical, Exec-managerial, Handlers-cleaners, H...
## $ relationship    <fct> Not-in-family, Husband, Not-in-family, Husband, Wif...
## $ race            <fct> White, White, White, Black, Black, White, Black, Wh...
## $ sex             <fct> Male, Male, Male, Male, Female, Female, Female, Mal...
## $ capital.gain    <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0,...
## $ capital.loss    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hours.per.week  <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30,...
## $ native.country  <fct> United-States, United-States, United-States, United...
## $ salary.class    <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, >5...

Before working on the examples, let’s just use “tidyverse” to remove all the observations with any missing values from our data:

# Removing entries with missing/ blank values from data:
salary <- salary %>%
  drop_na()

Piping data aka %>%

Pipes take the output from one function and feed it to the first argument of the next function. You may have encountered the Unix pipe | before. The R language allows symbols wrapped in % to be defined as functions, the > helps imply a chain. The magrittr R package contains the pipe function %>%. In the exmple above salary is being piped into the function drop_na.

Filtering Rows with filter()

filter() allows us to select a subset of rows in a data frame. To use “dplyr” functions, we first mention the name of the data (a data frame or a tibble object), followed by %>%, and then followed by filter().

We add the conditions for filtering inside the function filter(). Here is an example to filter the rows based on education value “Masters”:

salary %>%
  filter(education == "Masters")

If there are multiple conditions for filtering, we separate them using “,”. Below we demonstrate how we can combine filter for education and salary.class. The below code enables us to look at data with Masters degree earning more than 50K:

salary %>%
  filter(education == "Masters", salary.class == ">50K")

Selecting Columns with select()

In many cases, the dataset contains more columns than needed. Or in some occasions, we create new columns based on the original columns and won’t need the original ones anymore. select() allows us to quickly zoom in on a useful subset of the data:

salary %>%
  select(age, workclass, education, hours.per.week, salary.class)

We can combine select() and filter() separated by %>%. In the following example, we only select columns age, workclass, education, hours.per.week, and salary.class. We also filter the rows to only include observations that have a value greater than 50 for column/ variable age:

salary %>%
  select(age, workclass, education, hours.per.week, salary.class) %>%
  filter(age > 50)

There are a number of helper functions we can use with select() , like starts_with(), ends_with(), matches() and contains(). These functions let us quickly match larger blocks of variables that meet some criterion. See ?select for more details. In the following example, we only select columns that start with “edu”:

salary %>%
  select(starts_with("edu"))

Here, we only select columns that contain “class”:

salary %>%
  select(contains("class"))

Please note that select() can be used to remove variables too. For instance, we can remove variables like this:

salary <- salary %>%
  select(-c(variable1ToRemove, variable2ToRemove))

Arranging Rows with arrange()

arrange() works similarly to filter() except that instead of filtering rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. In the following example, we arrange the data salary based on columns age and education.num:

salary %>%
  arrange(age, education.num)

We can apply arrange in combination with other functions:

salary %>%
  select(age, workclass, education.num, hours.per.week, salary.class) %>%
  filter(age > 50) %>%
  arrange(age, education.num)

We can use desc() to order a column in descending order:

salary %>%
  select(age, workclass, education.num, hours.per.week, salary.class) %>%
  filter(age > 50) %>%
  arrange(desc(age), education.num)

Add New Columns with mutate()

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate(). In the following example, we create a new column which we call log_age which is the log-transformed of variable age:

salary %>%
  mutate(log_age = log(age))

If you only want to keep the new variables, use transmute(). Below code shows how we keep log_age and log_edu instead of age abd education.num:

salary %>%
  transmute(log_age = log(age), log_edu = log(education.num))

Notice that only the newly created columns are in the output.

We can use mutate_at to convert data from one type to another. For instance, imagine we want to convert some numeric and character variables to factors. Here is how to do it:

# Convert categorical variables to factors:
salary <- salary %>%
  mutate_at(c("numericVariable1","numericVariable2","characterVariable1",
                               "characterVariable1"), factor)

Common type conversions that we can use include: - as.character - as.numeric - as.integer - as.factor - as.logical

Summarise Values with summarise()

summarise() collapses a data frame to a single row:

salary %>%
  summarise(mean(age, na.rm = T))

Note: na.rm = TRUE automatically removes the missing values when performing the operation.

salary %>%
  summarise(mean(age, na.rm = T), sd(age, na.rm = T), mean(education.num, na.rm = T),
             sd(education.num, na.rm = T))

Later on, we learn how to use summarise() along with group_by().

Randomly Sample Rows with sample_n() & sample_frac()

In many real data science projects, we need to randomly sample the data (for instance, we usually train and test the models over different samples of the data). We can use sample_n() and sample_frac() to take a random sample of rows: use sample_n() for a fixed number and sample_frac() for a fixed fraction.

salary %>%
  sample_n(16)
salary %>%
  sample_frac(.01)

Joining Datasets Using inner_join and Other Similar Commands

It is very common to join two or more datasets to create a new dataset in a data science project. For instance, we could have two datasets each recorded in a different database. For building our model though, we may need to use the variables (columns) from both of those two datasets. To join datasets together, we need to find a unique identifier (key) in both datasets. For instance, student_id can be used to join a dataset about students’ academic performance to a dataset related to students’ financial aid. Once we have the key, we can use the following format to join the datasets:

new_data <- dataset1 %>%
  inner_join(dataset2, by = key)

Now, let’s first create two datasets from our salary data and then join those two datasets together to recreate the original salary data. But first, we create a unique identifier which later we use as the key to rejoin the datasets:

salary <- salary %>%
  mutate(id = row_number())

head(salary)

We then create the two datasets:

salary1 <- salary %>%
  select(id,age,workclass,fnlwgt,education,education.num,martital.status,occupation)

head(salary1)
salary2 <- salary %>%
  select(id,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,salary.class)

head(salary2)

Now we join them again to create the original data:

salary <- salary1 %>%
  inner_join(salary2, by = "id")

head(salary)

Below, we can learn more about the different join types and also filtering joins as well as nesting joins(from https://dplyr.tidyverse.org/reference/join.html):

Join types:

Currently dplyr supports four types of mutating joins, two types of filtering joins, and a nesting join.

Mutating joins combine variables from the two data.frames:

Filtering joins - keep cases from the left-hand data.frame:

Nesting joins - create a list column of data.frames:

Grouped Operations in “dplyr”

The “dplyr” functions are useful on their own, but they become even more powerful when we apply them to groups of observations within a dataset. In “dplyr”, we do this with group_by() function. It breaks down a dataset into specified groups of rows. We then apply the other functions above on the resulting object they’ll be automatically applied by groups.

Grouping affects the functions as follows:

In the following example, we split the dataset based on education into groups and then summarize each group by counting the number of observations (count = n()) and computing the average age (mean_age = mean(age, na.rm = TRUE)) in each group:

salary %>%
  group_by(education) %>%
  summarise(count = n(), mean_age = mean(age, na.rm = TRUE))

we can also group observations based on multiple columns:

salary %>%
  group_by(education, sex) %>%
  summarise(count = n(), mean_age = mean(age, na.rm = TRUE))

There are many other operations we can do using “dplyr”. Please refer to our references (https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html & https://dplyr.tidyverse.org) to learn more.

Using “tidy” to Explore & Clean Data

“tidyr” is another package within “tidyverse”. Here we use this package to further work with the data:

Here we really just use

#Load dairy data
dairy_clean <- read.csv("dairy_clean.csv", strip.white = T, na.strings = "")

head(dairy_clean)

Gathering Columns into “key-value” Pairs

# Apply gather() to dairy_clean and save the result as dairy_long
dairy_long <- gather(dairy_clean, YEAR, MILK_val, -FARM)

# View the first 20 rows of the result
head(dairy_long,20)

Spreading “key-value” Pairs

# Apply spread() to dairy_long
dairy_wide <- spread(dairy_long, YEAR, MILK_val)

# View the head of bmi_wide
head(dairy_wide)

Uniting Columns

# Apply unite() to dairy
dairy <- read.csv("dairy.csv", strip.white = T, na.strings = "")
dairy_unite <- unite(dairy, FARM_YEAR, FARM, YEAR, sep = "-")

# View the head of the result
head(dairy_unite)

Separating Columns

# Apply separate() to dairy_unite
dairy_seperate <- separate(dairy_unite, col = FARM_YEAR, into = c("FARM", "YEAR"), sep = "-")

# Print the head of the result
head(dairy_seperate)

A complete cheat sheet of “tidyverse” is given below:

Tidyverse cheatsheet Part 1

Tidyverse cheatsheet Part 1

Tidyverse cheatsheet Part 2

Tidyverse cheatsheet Part 2

Working with Dates

# Load lubridate package
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.6.2
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
##     date
# Parse as date
dmy("17 Sep 2015")
## [1] "2015-09-17"
# Parse as date and time (with no seconds!)
mdy_hm("July 15, 2012 12:56")
## [1] "2012-07-15 12:56:00 UTC"

String Manipulation

Trimming and Padding Strings

#Load stringr package
library(stringr)
# Trim all leading and trailing whitespace
str_trim(c("   Filip ", "Nick  ", " Jonathan"))
## [1] "Filip"    "Nick"     "Jonathan"
# Pad these strings with leading zeros
str_pad(c("23485W", "8823453Q", "994Z"), width=9, side='left', pad="0")
## [1] "00023485W" "08823453Q" "00000994Z"

Upper and Lower Case

states <- c("al", "ak", "az", "ar", "ca", "co", "ct", "de", "fl", "ga",
"hi", "id", "il", "in", "ia", "ks", "ky", "la", "me", "md", "ma",
"mi", "mn", "ms", "mo", "mt", "ne", "nv", "nh", "nj", "nm", "ny",
"nc", "nd", "oh", "ok", "or", "pa", "ri", "sc", "sd", "tn", "tx",
"ut", "vt", "va", "wa", "wv", "wi", "wy")

# Make states all uppercase and save result to states_upper
states_upper <- toupper(states)
states_upper
##  [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA"
## [16] "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ"
## [31] "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT"
## [46] "VA" "WA" "WV" "WI" "WY"
# Make states_upper all lowercase again
states_lower <- tolower(states_upper)
states_lower
##  [1] "al" "ak" "az" "ar" "ca" "co" "ct" "de" "fl" "ga" "hi" "id" "il" "in" "ia"
## [16] "ks" "ky" "la" "me" "md" "ma" "mi" "mn" "ms" "mo" "mt" "ne" "nv" "nh" "nj"
## [31] "nm" "ny" "nc" "nd" "oh" "ok" "or" "pa" "ri" "sc" "sd" "tn" "tx" "ut" "vt"
## [46] "va" "wa" "wv" "wi" "wy"

Finding and Replacing Strings

# Look at the head of salary
head(salary)
# Detect all native.country with United-States
str_detect(salary$native.country, "United-States")
# In the sex column, replace "Female" with "F"...
salary$sex <- str_replace(salary$sex, "Female", "F")

# ...And "Male" with "M"
salary$sex <- str_replace(salary$sex, "Male", "M")

# View the head of salary
head(salary)

Missing and Special Values

Finding missing values

salary2 <- read.csv("salary_subset.csv", strip.white = T, na.strings = "?")

# Call is.na() on the full salary2 to spot all NAs
is.na(salary2)
# Use the any() function to ask whether there are any NAs in the data
any(is.na(salary2))
## [1] TRUE
# View a summary() of the dataset
summary(salary2)
##       age                   workclass      fnlwgt              education
##  Min.   :19.00                   : 1   Min.   : 28887   HS-grad     :13
##  1st Qu.:30.00   Federal-gov     : 3   1st Qu.:122147   Bachelors   :11
##  Median :38.50   Local-gov       : 2   Median :192038   Some-college: 7
##  Mean   :38.42   Private         :33   Mean   :205162   11th        : 4
##  3rd Qu.:49.00   Self-emp-not-inc: 6   3rd Qu.:269627   Masters     : 4
##  Max.   :59.00   State-gov       : 3   Max.   :544091   9th         : 3
##                                                         (Other)     : 6
##  education.num                martital.status             occupation
##  Min.   : 4.00   Divorced             : 6     Exec-managerial  : 9
##  1st Qu.: 9.00   Married-AF-spouse    : 1     Prof-specialty   : 7
##  Median :10.00   Married-civ-spouse   :26     Adm-clerical     : 5
##  Mean   :10.33   Married-spouse-absent: 1     Machine-op-inspct: 5
##  3rd Qu.:13.00   Never-married        :12     Sales            : 4
##  Max.   :16.00   Separated            : 2     (Other)          :17
##                                               NA's             : 1
##         relationship                 race        sex      capital.gain
##  Husband      :23    Amer-Indian-Eskimo: 1   Female:11   Min.   :    0.0
##  Not-in-family: 8    Asian-Pac-Islander: 3   Male  :37   1st Qu.:    0.0
##  Own-child    : 7    Black             :10               Median :    0.0
##  Unmarried    : 7    White             :34               Mean   :  446.6
##  Wife         : 3                                        3rd Qu.:    0.0
##                                                          Max.   :14084.0
##
##   capital.loss     hours.per.week        native.country salary.class
##  Min.   :   0.00   Min.   :13.00                : 6     <=50K:36
##  1st Qu.:   0.00   1st Qu.:40.00   Cuba         : 1     >50K :12
##  Median :   0.00   Median :40.00   India        : 1
##  Mean   :  71.88   Mean   :41.23   Jamaica      : 1
##  3rd Qu.:   0.00   3rd Qu.:44.25   Mexico       : 1
##  Max.   :2042.00   Max.   :80.00   Puerto-Rico  : 1
##                                    United-States:37
# Call table() on the native.country column
table(salary2$native.country)
##
##                        Cuba         India       Jamaica        Mexico
##             6             1             1             1             1
##   Puerto-Rico United-States
##             1            37

Dealing with missing values

# Replace all empty strings in status with NA
salary2$native.country[salary2$native.country == ""] <- NA

# Print salary2 to the console
salary2
# Use complete.cases() to see which rows have no missing values
complete.cases(salary2)
##  [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [13] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [37]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
# Use na.omit() to remove all rows with any missing values
na.omit(salary2)

Outliers & Obvious Errors

Dealing with outliers and obvious errors

# Look at a summary() of salary
summary(salary)
##        id             age                   workclass         fnlwgt
##  Min.   :    1   Min.   :17.00   Private         :22286   Min.   :  13769
##  1st Qu.: 7541   1st Qu.:28.00   Self-emp-not-inc: 2499   1st Qu.: 117627
##  Median :15082   Median :37.00   Local-gov       : 2067   Median : 178425
##  Mean   :15082   Mean   :38.44   State-gov       : 1279   Mean   : 189794
##  3rd Qu.:22622   3rd Qu.:47.00   Self-emp-inc    : 1074   3rd Qu.: 237629
##  Max.   :30162   Max.   :90.00   Federal-gov     :  943   Max.   :1484705
##                                  (Other)         :   14
##         education    education.num                martital.status
##  HS-grad     :9840   Min.   : 1.00   Divorced             : 4214
##  Some-college:6678   1st Qu.: 9.00   Married-AF-spouse    :   21
##  Bachelors   :5044   Median :10.00   Married-civ-spouse   :14065
##  Masters     :1627   Mean   :10.12   Married-spouse-absent:  370
##  Assoc-voc   :1307   3rd Qu.:13.00   Never-married        : 9726
##  11th        :1048   Max.   :16.00   Separated            :  939
##  (Other)     :4618                   Widowed              :  827
##            occupation           relationship                   race
##  Prof-specialty :4038   Husband       :12463   Amer-Indian-Eskimo:  286
##  Craft-repair   :4030   Not-in-family : 7726   Asian-Pac-Islander:  895
##  Exec-managerial:3992   Other-relative:  889   Black             : 2817
##  Adm-clerical   :3721   Own-child     : 4466   Other             :  231
##  Sales          :3584   Unmarried     : 3212   White             :25933
##  Other-service  :3212   Wife          : 1406
##  (Other)        :7585
##      sex             capital.gain    capital.loss     hours.per.week
##  Length:30162       Min.   :    0   Min.   :   0.00   Min.   : 1.00
##  Class :character   1st Qu.:    0   1st Qu.:   0.00   1st Qu.:40.00
##  Mode  :character   Median :    0   Median :   0.00   Median :40.00
##                     Mean   : 1092   Mean   :  88.37   Mean   :40.93
##                     3rd Qu.:    0   3rd Qu.:   0.00   3rd Qu.:45.00
##                     Max.   :99999   Max.   :4356.00   Max.   :99.00
##
##        native.country  salary.class
##  United-States:27504   <=50K:22654
##  Mexico       :  610   >50K : 7508
##  Philippines  :  188
##  Germany      :  128
##  Puerto-Rico  :  109
##  Canada       :  107
##  (Other)      : 1516
# View a histogram of the age variable
hist(salary$age)

# View a histogram of the hours.per.week
hist(salary$hours.per.week)

Another look at strange values

# View a boxplot of age
boxplot(salary$age)

# View a boxplot of hours.per.week
boxplot(salary$hours.per.week)

We can also create a boxplot to compare the distribution of a numeric variable with respect to a categorical variables:

# View a boxplot of hours.per.week by sex
boxplot(salary$hours.per.week ~ salary$sex)

Summary of Other Useful Functions for Exploring Data

Below, we can find a collection of useful Functions that help us navigate the data in r:

head(data_csv, n=4) # Returns the first 4 rows of the data frame data_csv
View(data_csv)  # Opens the data frame data_csv in R
colnames(data_csv)  # Returns the column names in data_csv
nrow(data_csv)  # Returns the number of rows in data_csv
ncol(data_csv)  # Returns the number of columns in data_csv
dim(data_csv)   # Returns the dimensions of data_csv
dim(data_csv)[2]
typeof(data_csv$age)    # Returns the type of variable “age”
length(data_csv$age) # Returns the number of rows in vector “age”
mean(data_csv$age)  # Returns the mean of variable “age”
sd(data_csv$age)    # Returns the standard deviation of variable “age”
summary(data_csv$age)   # Returns the summary of variable “age”
hist(data_csv$age)  # Returns the histogram of variable “age”
unique(data_csv$age) # Returns all unique values of variable "age"
sum(is.na(data_csv$workclass)) # Returns the number of missing values in variable "workclass"